home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Libris Britannia 4
/
science library(b).zip
/
science library(b)
/
INFO
/
CPUCMP14.ZIP
/
CPUCMP14.TXT
Wrap
Text File
|
1993-06-14
|
32KB
|
531 lines
Performance Comparison Intel 386DX, Intel RapidCAD, C&T 38600DX, Cyrix 486DLC
This document, containing descriptions and a benchmark comparison of the
Intel 386DX, Intel RapidCAD, C&T 38600DX, and the Cyrix 486DLC has been
put together for the benefit of the net.community. I believe, but cannot
guarantee, that all data given herein is correct. If you would like to
report errors in this document, or have suggestions for its enhancement,
feel free to contact me at the following email address:
JUFFA@IRA.UKA.DE
You can also write to me at my snail mail address:
Norbert Juffa
Wielandtstr. 14
7500 Karlsruhe 1
Germany
Corrections pertaining to spelling or grammatical errors are also encouraged,
especially from those net.users that are native speakers of (American) English.
This is the third version of this document. Thanks to the people that helped
improve it by their comments on the previous versions:
Alex Martelli (martelli@cadlab.sublink.org),
Fred Dunlap (fred@cyrix.com),
Antony Suter (antony@werple.apana.org.au)
A special thanks for editing this article goes to:
David Ruggiero (osiris@halcyon.halcyon.com)
The previous version of this article has also appeared in print. Check out:
Juffa, N.: "Performance Comparison Intel 386DX, Intel RapidCAD, C&T 38600DX,
Cyrix 486DLC". In: Association Francaise des Utilisateurs d'Unix et des
Systemes Ouverts (ed.): "Dossier special No 4 - Benchmarks", March 1993,
pp. 109-117
This document released 14-June-1993
---------------------------------------------------------------------------
In 1992, three new CPUs were introduced into the PC marketplace. Each has the
capability to replace the standard Intel 386DX CPU and provide higher
performance for existing 386 systems. These new chips are the Intel RapidCAD,
the Chips&Technologies (C&T) Super386 38600DX, and the Cyrix 486DLC. All chips
had initially been released in 33 MHz versions, but at least the Cyrix 486DLC
is available in a 40 MHz version now. While the Intel RapidCAD is marketed
only as an end-user product (which means PC manufacturers will not ship systems
with the RapidCAD already installed), C&T sells their products directly to the
end user as well as to PC motherboard manufacturers. Cyrix currently sells
only to OEMs (Original Equipment Manufacturers), that is motherboard
manufacturers. Cyrix also manufactures the 486SLC, which is a replacement for
the Intel/AMD 386SX CPUs. C&T has announced a 38600SX chip, but it will not
ship before sometime in 1993, if at all.
The Intel RapidCAD
------------------
The Intel RapidCAD is, in rough terms, an Intel 486DX with the 8 KB on-chip
cache removed and a 386 pinout. To software, it appears to be a 386DX with a
math coprocessor, as the 486-specific instructions have been removed from its
instruction set. It is marketed by Intel as "the ultimate coprocessor" and, as
the name implies, its main purpose is to replace the 386DX in existing systems
and thereby boost the performance of floating-point intensive applications such
as CAD software, spreadsheets, and math packages (e.g. SPSS, Mathematica).
RapidCAD is delivered as a set of two chips. RapidCAD-1 is a 132-pin PGA (pin
grid array) chip that goes into the CPU socket and replaces the 386DX. It
contains the CPU and the FPU (floating-point unit). RapidCAD-2 is a 68-pin PGA
chip that goes into the 387 coprocessor (or EMC) socket; it contains a simple
PLA whose only purpose it to generate the FERR signal for the motherboard
logic: this provides full PC-compatible handling of floating-point exceptions.
Many CPU instructions execute in one clock cycle in the RapidCAD, just as on
the Intel 486. The RapidCAD is constrained, however, by use of the standard
386DX bus interface, since every bus cycle takes two CPU clock cycles. This
means that instructions may be executed by the RapidCAD faster than they can be
fetched from memory. But as most FPU instructions take longer to execute than
CPU instructions, they are not greatly slowed down by this slow bus interface,
and most of them execute in the same time as in the Intel 486DX. Therefore, the
Intel RapidCAD provides higher overall floating-point performance than any
386/387 combination. Results from the SPEC benchmarks, a common workstation
UNIX benchmark suite, show that the RapidCAD provides 85% more floating-point
and 15% more integer performance than an Intel 386DX/387DX combination at the
same clock frequency.
Power consumption of the RapidCAD is typically 3500 mW at 33 MHz. The current
street price of the 33 MHz version is ~300 US$.
The C&T 38600DX
---------------
The Chips&Technologies 38600DX is designed to be 100% compatible to the Intel
386DX CPU. Unlike AMD's Am386 CPUs (which use microcode that is identical to
the Intel 386DX's microcode), the C&T 38600DX uses new microcode developed
within C&T using "clean-room" techniques. C&T even included the 386DX's
"undocumented" LOADALL386 instruction in the instruction set to provide full
compatibility with the Intel chip.
Some instructions execute faster on the 38600DX than on the 386DX. C&T also
produces a 38605DX CPU that includes a 512-byte instruction cache and provides
further performance increases over the Intel part. The 38605DX needs a bigger
socket (144-pin PGA) and is therefore *not* pin-compatible with the 386DX.
In my tests, I found that the 38600DX has severe problems with the CPU-
coprocessor communication, which causes its floating-point performance to drop
below the level provided by the Intel 386DX/Intel 387DX for most programs. This
problem exists with all available 387 compatible coprocessors (ULSI 83C87, IIT
3C87, Cyrix EMC87, Cyrix 83D87, Cyrix 387+, C&T 38700, and the Intel 387DX).
(A net.acquaintance also did tests with the 38600DX and arrived at similar
results. He contacted C&T and they said they were aware of the problem.)
Typical power consumption of the 40 MHz 38600DX is 1650 mW, which is less than
the typical power consumption of the Intel 386DX at 33 MHz (2000 mW). The
38600DX currently sells for ~US$ 80 for the 33 MHz version.
The Cyrix 486DLC
----------------
The Cyrix 486DLC is the latest entry into the market of 386DX replacements. It
features a 486SX compatible instruction set, a 1 KB on-chip cache, and a 16x16-
bit hardware multiplier. The RISC-like execution unit of the 486DLC executes
many instructions in a single clock cycle. The hardware multiplier multiplies
16-bit quantities in 3 clock cycles, as compared to 12-25 cycles on a standard
Intel 386DX. This is especially useful in address calculations (code from non-
optimizing compilers may contain many MUL instructions for array accesses) and
for software floating-point arithmetic (e.g., a coprocessor emulator). The 1 KB
cache helps the 486DLC to overcome the limitations of the 386 bus interface,
although its hit rate averages only about 65% under normal program conditions.
In existing 386 systems, DMA transfers (such as those performed by a SCSI
controller or a sound board) may force the internal cache of the 486DLC to be
flushed as the only means available in a 386 system to enforce consistency
between the contents of the on-chip cache and external memory. This stems from
the fact that the 386 bus interface was designed without provisions for an
on-chip cache. This problem can reduce the performance of the 486DLC in systems
that do a sizeable amount of (bus master) DMA. Cyrix has, however, defined some
additional cache control signals for some of the 486DLC lines; they can be used
to improve communications between the on-chip cache and a larger external cache
or main memory and prevent this problem. Current 386 systems ignore these
signals since they are not defined for the Intel 386DX. However, motherboards
designed for use with the Cyrix chip should be able to take advantage of them
and thereby gain increased performance.
The Cyrix cache is a unified data/instruction write-through type and can be
configured as either a direct-mapped or 2-way set associative cache. It permits
definition of up to four non-cacheable regions, which are particularly useful
if a system has memory mapped peripherals (e.g., the Weitek math coprocessor).
For compatibility reasons, the cache is disabled after a processor reset
and must be enabled with the help of a small program provided by Cyrix. This
prevents problems with BIOSes that are not "486DLC aware". Motherboards that
come with the 486DLC installed usually feature a BIOS that supports the cache
and switch it on after the machine self-test. On these machines, no additional
software is required to take advantage of the cache. The cache control program
from Cyrix is a DOS program. Antony Suter (antony@werple.apana.org.au) informs
me that public domain drivers for OS/2, 386BSD, and Linux that enable the
cache are available from various ftp servers.
Not all 486DLCs will work correctly with all math coprocessors in all
circumstances, with protected mode multitasked environments (e.g. MS-Windows
386-enhanced mode) being especially critical. Using the 486DLC with the Cyrix
EMC87, Cyrix 83D87 (chips produced prior to November 1991), and IIT 3C87, I
have been able to completely lock up the machine due to synchronization
problems between the CPU and the coprocessor while executing the FSAVE or
FRSTOR instructions (which are used to save and restore the coprocessor status
during task switches). According to Cyrix, this problem only occurred with the
first revision of the 486DLC (like the sample chip I have) and is fixed on
newer ones. To be on the safe side, the 486DLC should best be used with the
Cyrix 387+ (its "Europe-only" name) or with the identical Cyrix 83D87
(US-bound chips manufactured after October 1991): these are not only the
highest performing 387 coprocessors on the market, but they also work properly
even with the first generation 486DLC.
If you already have a Cyrix 83D87 coprocessor and want to know whether it is
the old or new type, I recommend you use my COMPTEST program, available as
CTEST259.ZIP via anonymous ftp from garbo.uwasa.fi and other fine ftp servers.
If COMPTEST reports a 387+, you either have the 387+ or the identical newer
version of the 83D87 installed and can use any version of the Cyrix 486DLC
without problems. If you believe that you may have problems with a 486DLC/387
combination, I suggest you contact Cyrix technical support (1-800-FAS-MATH in
the US).
Power consumption of a 40 MHz 486DLC is typically 2800 mW. As Cyrix can't fill
demand for their 486DLC/SLC chips, they currently sell these chips only to
OEMs. About two million 486SLC/DLC CPUs have been sold so far. The 486SLC is
especially popular with notebook manufacturers. Chips sold by retailers seem
to be "surplus" chips sold off by OEMs at a premium price. While the price for
a 486DLC-40 is about US$ 100 when bought by an OEM, it sells for about US$
200-250 on the open market.
Tests and benchmarks
--------------------
HW configuration: 33.3/40 MHz motherboard with Forex chip set and AMI BIOS. 128
KB zero- wait-state, direct mapped, write-through CPU cache with one write
buffer, 4 bytes per cache line, and 4 clock cycles penalty for a cache line
miss. 8 MB of main memory with an average access time of 1.6 wait states. Cyrix
EMC87 in 387 compatibility mode as math coprocessor. (This and the Cyrix 83D87
/ 387+ are the fastest coprocessors available for use with the
386DX/486DLC/38600DX). Conner 3204F hard disk, 203 MB capacity, IDE interface
(CORETEST throughput 1100 KB/s, seek time 16 ms). Diamond SpeedSTAR HiColor,
ISA bus SVGA card using Tseng's ET4000 chip, 1 MB DRAM as frame buffer, *no*
accelerator. The switches on the card were set for fastest reliable operation,
with a VIDSPEED throughput of 6500 bytes/ms at 40 MHz and 5400 bytes/ms at
33.3 MHz.
SW configuration: MS-DOS 5.0, MS Windows 3.1, HyperDisk 4.32 disk cache program
in write-back mode, using 2 MB of extended memory, 386MAX 6.01 used as memory
manager and DPMI provider in some benchmarks. Latest Tseng (Colorview) driver
for Windows 3.1 at 1024x768x256, using the 8514 fonts.
For the Whetstone, Dhrystone, WINTACH, DODUC, LINPACK, LLL, and Savage
benchmarks, *higher* numbers indicate *faster* performance.
For the MAKE RTL, MAKE TRANCK, and String-Test benchmarks, *lower* numbers
indicate *faster* performance.
Intel C&T Intel Cyrix Cyrix
33.3 MHz 386DX 38600DX RapidCAD 486DLC 486DLC
cache off cache on
integer
Whetstone [kWhets/s] 447 585 563 695 803
Dhrystone (C) [Dhry./s] 11688 11819 12357 14150 15488
Dhrystone (Pas) [Dhry./s]10455 10877 10751 12154 13858
String-Test [ms] 459 453 441 347 327
MAKE RTL [s] 51.32 47.10 46.34 43.45 39.13
MAKE TRANCK [s] 62.42 55.47 55.37 53.64 46.12
WINTACH [overall RPM] 4.85 4.90 5.49 5.53 6.14
float
DODUC [Rapidity index] 79.0 76.4 150.3 89.4 90.7
LINPACK [MFLOPS] 0.2808 0.2707 0.4578 0.3158 0.3438
LLL [MFLOPS] 0.3352 0.3537 0.6083 0.3816 0.4139
Whetstone [kWhets/s] 2540 2340 3990 2908 3061
Savage [function eval/s] 71685 53191 72464 88757 93897
Intel C&T Intel Cyrix Cyrix
40.0 MHz 386DX 38600DX RapidCAD 486DLC 486DLC
cache off cache on
integer
Whetstone [kWhets/s] 536 702 676 835 963
Dhrystone (C) [Dhry./s] 14128 14116 14836 16987 18750
Dhrystone (Pas) [Dhry./s]12490 13067 12890 14573 16624
String-Test [ms] 384 377 368 289 273
MAKE RTL [s] 43.46 40.11 39.84 37.25 33.54
MAKE TRANCK [s] 53.00 47.59 47.07 45.36 39.00
WINTACH [overall RPM] 5.65 5.73 6.41 6.46 7.23
float
DODUC [Rapidity index] 94.9 77.5 180.3 105.1 106.6
LINPACK [MFLOPS] 0.3324 0.3260 0.5418 0.3789 0.4131
LLL [MFLOPS] 0.4025 0.4204 0.7263 0.4562 0.4956
Whetstone [kWhets/s] 3061 2632 4798 3505 3677
Savage [function eval/s] 86083 49587 86957 106762 112360
To complete the picture, I ran the CPU/FPU standard benchmarks on an Intel
486DX running at 33.3/40 MHz. Since the 486 machine was configured with a
different hard disk than the 386 system, and the compilers and tools installed
on the 386 machine were not present, the MAKE benchmarks could unfortunately
not be included in the tests.
486DX, 256 KBytes CPU cache (write-through), 8 MB of RAM, AMI-BIOS, Diamond
SpeedSTAR HiColor (VIDSPEED throughput: 6500 bytes/ms at 40 MHz and 5400
bytes/ms at 33.3 MHz), MS-DOS 5.0, MS-Windows 3.1:
integer 33.3 MHz 40 MHz
Whetstone [kWhets/s] 707 848
Dhrystone (C) [Dhry./s] 19394 23265
Dhrystone (Pas) [Dhry./s] 16978 20368
String-Test [ms] 333 279
WINTACH 8.59 10.14
float
DODUC [Rapidity index] 184.0 220.7
LINPACK [MFLOPS] 0.6682 0.8204
LLL [MFLOPS] 0.9387 1.1110
Whetstone [kWhets/s] 5143 6195
Savage [function eval/s] 82192 98522
Conclusions
-----------
The Cyrix 486DLC is the 386DX replacement with the highest integer performance.
With the internal cache enabled, integer performance of the 486DLC can be up to
80% higher compared with a Intel 386DX at the same clock frequency, with the
average speed gain for integer applications being 35%. Enabling the internal
cache provides about 5-15% more performance than with the cache disabled for
both integer and floating-point applications. Floating-point applications are
accelerated by about 15%-30% if the Cyrix 486DLC (with cache enabled) is used
instead of the Intel 386DX. Compared with the Intel 486DX, the Cyrix 486DLC
provides about 70% of the integer performance and about 50% of the floating-
point performance at the same clock frequency.
The Intel RapidCAD is the 386DX replacement that provides the highest floating-
point performance. It can speed up most floating-point intensive programs by
60%-90% compared with the fastest Intel 386DX/math coprocessor combination; it
provides nearly 75% of the floating-point performance of a Intel 486DX at the
same clock frequency. Integer performance increases by an average 15% by using
the RapidCAD instead of the standard Intel 386DX, with the maximum performance
gain being 35%.
The Chips&Technologies 38600DX has a slightly higher integer performance than
the Intel 386DX, with the speedup ranging from 0%-30% and an average speedup on
the order of 10%.
Description of benchmarks
-------------------------
DHRYSTONE [9] is a synthetic benchmark developed by R. Weicker from Siemens in
1984. The frequency of operations and data types used by Dhrystone are modeled
after statistics collected for 'typical' programs that are written in a HLL
(high level language) such as C or Pascal that do not use floating-point
arithmetic. Thus there is a certain distribution of global and local variables
being used in procedures, there is a certain percentage of use of records
(structs) or strings out of the total number of variables accessed and there is
a certain percentage of procedure calls and if-statements out of the total
lines of codes executed. All these percentages match the statistics used in the
development of Dhrystone quite closely. To preserve the typical distributions,
the measurement rules for Dhrystone forbid function inlining (a frequently used
optimization technique optionally performed by most optimizing compilers). The
current version of Dhrystone is 2.1, and this is the version used for my tests.
Version 2.1 [10] differs from previous versions in that it contains additional
code that prevents optimizing compilers from throwing out most of the code by
dead code elimination (since Dhrystone has no input file, many expressions can
be computed at compile time), thus artificially inflating performance numbers.
The Dhrystone benchmark exists in equivalent Ada, Pascal and C versions, which
are all available from netlib@ornl.gov. Dhrystone measures the time to execute
its main loop and sets this time into relation with a fixed reference time, the
result being a performance index given in "Dhrystones per second". I used two
Dhrystone executables. One was compiled using the non-optimizing Turbo Pascal
6.0 compiler, the other one was compiled with the MS-C 7.0 compiler using the
large memory model and maximum optimization but without the use of function
inlining. Because the Turbo Pascal executable uses more memory operands and the
MS-C executable uses more register-to-register operations, the speedup for the
executables on different CPUs can be noticeably different.
WINTACH is a public domain benchmark program by Texas Instruments that measures
the speed of graphics output for four typical MS-Windows applications: a word
processor, a spreadsheet, a CAD program, and a paint program. TI has collected
profiles for each class of applications under 'typical' user loads. The four
parts of the WINTACH benchmark were then modeled after these profiles, so that
the percentage of certain GDI (graphics device interface) calls found in the
profiles is also present in the benchmark. WINTACH determines the performance
of each program part relative to the performance of a standard VGA card in a 20
MHz 386DX machine. It also combines the four values into an overall relative
performance index, which is reported in my tables. Using this single number to
represent graphics performance is justified in this case since the four partial
results do *not* deviate much from the average for the SVGA tested (a Diamond
SpeedSTAR HiColor). I used a frame buffer card for the test because for these
graphics card type, graphics performance depends solely on the performance of
the CPU, which handles all accesses to the display memory. On an accelerated
card, some of the low-level operations are delegated to the accelerator chip
and the influence of the CPU speed on the graphics performance is not seen so
clearly. For this test, I used the latest Tseng Windows 3.1 driver (Colorview)
at a resolution of 1024x768x256 with the large 8514 fonts (WINTACH code C8).
MAKE RTL is not a single program. Rather it is a build of a complete project,
my own run-time library for Turbo Pascal 6.0. The project consists of about 200
source files with a total size of ~650 KByte. Most of the source files are
assembly language (.ASM) files, while there are also some Pascal (.PAS) files.
Building the complete project using MAKE, about 200 binary files (.OBJ and
.TPU) are produced with a combined size of ~300 KB. The programs used during
the build are MAKE 3.5, TPC 6.01, and TASM 2.01, all by Borland, Inc. All files
are read from and written to the hard disk, using a 2 MB write back disk cache
provided by the HyperDisk 4.32 program. This eliminates most of the I/O
overhead. No memory manager was installed during the test. The time reported is
the time from starting the MAKE utility to the reappearance of the DOS prompt.
MAKE RTL can be thought of being typical of applications that operate on a lot
of small files, and use only integer instructions.
MAKE TRANCK is a project build for a project consisting of two assembly
language modules, two C modules and one FORTRAN module. The combined size of
the source files is approximately 120 KBytes. All modules are compiled
(assembled) and combined into a single program linking with both, the C and the
FORTRAN libraries. Programs used are MAKE 3.5 by Borland and MASM 6.0, MS-C
7.0, MS-FORTRAN 5.0, and LINK 5.3, all by Microsoft, Inc. The C compiler and
the linker run in protected mode and require a DPMI host to be present. 386MAX
6.01 by Qualitas was installed for this purpose. To minimize I/O overhead, the
HyperDisk 4.32 disk caching program was installed, with 2 MByte of extended
memory allocated to the cache. The modules are compiled with compiler switches
set for maximum optimization, and most of the time for the project build is
spent in the optimizing stages of the compilers.
STRING-TEST is a simple benchmark that tests the string handling functions
built into the Turbo Pascal language. Nearly all of the execution time is spent
in repeated execution of the STOS, CMPS, SCAS, and MOVS instructions. The data
and code fit into about five KBytes of memory. In cached systems, almost all
memory accesses will be cache hits. The program was written and compiled with
Turbo Pascal 6.0 and linked with my own run-time library, which provides much
faster string functions than the run-time library delivered by Borland.
Compiler switches were set for fastest execution. The time reported is the time
to complete the whole benchmark in milliseconds, with an accuracy of +/-1
millisecond.
LLL is short for Lawrence Livermore Loops [8], a set of kernels taken from real
floating-point extensive programs. Some of these loops are vectorizable, but
since we don't deal with vector processors here, this doesn't matter. For this
test, LLL was adapted from the FORTRAN original in [7] to Turbo Pascal. By
variable overlaying (similar to FORTRAN's EQUIVALENCE statement) memory
allocation for data was reduced to 64 KB, so all data fits into a single 64 KB
segment. The older version of LLL is used here which contains 14 loops. There
also exists a newer, more elaborate version consisting of 24 kernels. The
kernels in LLL exercise only multiplication and addition. The MFLOPS rate
reported is the average of the MFLOPS rate of all 14 kernels as reported by the
LLL program. All floating-point variables in the program are of type double.
LLL and Whetstone results (see below) are reported as returned by my COMPTEST
test program in which they have been included as a measure of coprocessor/FPU
performance. COMPTEST has been compiled under Turbo Pascal 6.0 with all
'optimizations' on and using my own run-time library, which gives higher
performance than the one included with TP 6.0. My library is available as
TPL60N19.ZIP from garbo.uwasa.fi and ftp-sites that mirror this site.
LINPACK [4] is a well known floating-point benchmark that also heavily
exercises the memory system. Linpack operates on large matrices and takes up
about 570 KB in the version used for this test. This is about the largest
program size a pure DOS system can accommodate. Linpack was originally designed
to estimate performance of BLAS, a library of FORTRAN subroutines that handles
various vector and matrix operations. Note that vendors are free to supply
optimized (e.g. assembly language versions) of BLAS. Linpack uses two routines
from BLAS which are thought to be typical of the matrix operations used by
BLAS. Both routines only use addition/subtraction and multiplication. The
FORTRAN source code for Linpack can be obtained from the automated mail server
netlib@ornl.gov. Linpack was compiled using MS FORTRAN 5.0 in the HUGE memory
model (which can handle data structures larger than 64 KB) and with compiler
switches set for maximum optimization. All floating-point variables in the
program are of the DOUBLE type. Linpack performs the same test repeatedly. The
number reported is the maximum MFLOPS rate returned by Linpack. Linpack MFLOPS
ratings for a great number of machines are contained in [5]. This PostScript
document is also available from netlib@ornl.gov.
WHETSTONE [1,2,3] is a synthetic benchmark based on statistics collected about
the use of certain control and data structures in programs written in high
level languages. Based on these statistics, Whetstone tries to mirror a
'typical' HLL program. Whetstone performance is expressed by how many
hypothetical 'whetstone' instructions are executed per second. It was originally
implemented in ALGOL. Unlike LLL and Linpack, Whetstone not only uses addition
and multiplication but exercises all basic arithmetic operations as well as
some transcendental functions. Whetstone performance depends on the speed of
the coprocessor as well as on the speed of the CPU, while LLL and Linpack place
a heavier burden on the coprocessor/FPU. There exist an old and a new version
of Whetstone. Note that results from the two versions can differ by as much as
20% for the same test configuration. For this test, the new version in Pascal
from [2] was used. It was compiled with Turbo Pascal 6.0 and my own library
(see above) with all 'optimizations' on. For the integer test, software
floating-point arithmetic using the REAL type was utilized. Using the software
arithmetic exercises only the CPU, in particular the execution of MOV, SHL, SHR,
RCR, ADD, SUB, MUL, and DIV instructions. For the floating-point test, the
hardware floating-point arithmetic of the coprocessor was used and computations
were performed using the DOUBLE type.
SAVAGE tests the performance of transcendental function evaluation. It is
basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt
functions are combined in a single expression. While sin, cos, arctan, and sqrt
can be evaluated directly with a single 387 coprocessor instruction each, ln
and exp need additional preprocessing for argument reduction and result
conversion. According to [11], the Savage benchmark was devised by Bill Savage,
and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore Front Parkway,
Rockaway Beach, NY 11693, USA. Usually, Savage is programmed to make 250,000
passes though the loop. Here only 10,000 loops are executed for a total of
60,000 transcendental function evaluations. The result is expressed in function
evaluations per second. SAVAGE source code was taken from [6] and compiled with
Turbo Pascal 6.0 and my own run-time library (see above).
DODUC [12] is a modified application program by Nhuan Doduc that is also part
of the SPEC benchmark suite. It is a nuclear safety analysis code that
simulates the time evolution of a thermohydraulic modelization for a nuclear
reactor's component and the benchmark is created from the computational kernel
of the original application. The benchmark consists of 5323 FORTRAN statement
lines and the executable size is about 350 KBytes. Almost all processing is
done using double precision floating-point values. There is not much array
processing. Instead the program has an iterative structure with an abundance of
short branches and small loops. The version used for this test was compiled
with the highly optimizing NDP FORTRAN V3.0 compiler from Microway, which
generates a 32-bit mode executable that runs in protected mode using a
protected mode loader provided by Microway. The execution time of the DODUC
program is set into relation with fixed reference times and the result is
scaled to give a "Rapidity index".
References
----------
[1] Curnow, H.J.; Wichmann, B.A.: A synthetic benchmark.
Computer Journal, Vol. 19, No. 1, 1976, pp. 43-49
[2] Wichmann, B.A.: Validation code for the Whetstone benchmark.
NPL Report DITC 107/88, National Physics Laboratory, UK, March 1988
[3] Curnow, H.J.: Wither Whetstone? The Synthetic Benchmark after 15 Years.
In: Aad van der Steen (ed.): Evaluating Supercomputers.
London: Chapman and Hall 1990
[4] Dongarra, J.J.: The Linpack Benchmark: An Explanation.
In: Aad van der Steen (ed.): Evaluating Supercomputers.
London: Chapman and Hall 1990
[5] Dongarra, J.J.: Performance of Various Computers Using Standard Linear
Equations Software.
Report CS-89-85, Computer Science Department, University of Tennessee,
March 11, 1992
[6] Huth, N.: Dichtung und Wahrheit oder Datenblatt und Test.
Design & Elektronik 1990, Heft 13, Seiten 105-110
[7] Esser, R.; Kremer, F.; Schmidt, W.G.: Testrechnungen auf der IBM 3090E mit
Vektoreinrichtung.
Arbeitsbericht RRZK-8803, Regionales Rechenzentrum an der Universit"at zu
K"oln, Februar 1988
[8] McMahon, H.H.: The Livermore Fortran Kernels: A test of the numerical
performance range.
Technical Report UCRL-53745, Lawrence Livermore National
Laboratory, USA, December 1986
[9] Weicker, R.P.: Dhrystone: A Synthetic Systems Programming Benchmark.
Communications of the ACM, Vol. 27, No. 10, October 1984, pp. 1013-1030
[10] Weicker, R.P.: Dhrystone Benchmark: Rationale for Version 2 and
Measurement Rules.
SIGPLAN Notices, Vol. 23, No. 8, August 1988, pp. 49-62
[11] FasMath 83D87 Benchmark Report. Cyrix Corporation, June 1990
Order No. B2004
[12] Doduc, Nhuan: Fortran Execution Time Benchmark.
Unpublished manuscript, version 45, available from ndoduc@framentec.fr